The objective of this task is to help the Category Manager for Chips in a grocery store, better understand the types of customers who purchase Chips and their purchasing behaviour within the region.
To accomplish this task, we are provided with two datastes, 1 Year of potato chip transactions and Customer details.
We are going to prepare these datasets for analysis by cleaning and merging them together. We then proceed to creating extra features that we might deem necessary for the analysis.
Once we have the final dataset ready, we can proceed to analysing the data with the goal of understanding which segments are the biggest drivers of sales and what are their purchasing preferences to create recomendations for the category manager.
# Import libraries to be used in this section
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")
# Load the data
data_t = pd.read_excel("./Original Data/QVI_transaction_data.xlsx")
data_c = pd.read_csv("./Original Data/QVI_purchase_behaviour.csv")
# Sample of the transaction dataset
data_t.sample(5)
| DATE | STORE_NBR | LYLTY_CARD_NBR | TXN_ID | PROD_NBR | PROD_NAME | PROD_QTY | TOT_SALES | |
|---|---|---|---|---|---|---|---|---|
| 126598 | 2018-07-04 | 132 | 132018 | 135874 | 104 | Infuzions Thai SweetChili PotatoMix 110g | 2 | 7.6 |
| 207007 | 2019-06-25 | 63 | 63170 | 60489 | 82 | Smith Crinkle Cut Mac N Cheese 150g | 2 | 5.2 |
| 61881 | 2018-08-12 | 114 | 114241 | 118425 | 7 | Smiths Crinkle Original 330g | 2 | 11.4 |
| 149022 | 2018-09-09 | 243 | 243435 | 247202 | 47 | Doritos Corn Chips Original 170g | 2 | 8.8 |
| 3031 | 2019-01-17 | 157 | 157164 | 159430 | 26 | Pringles Sweet&Spcy BBQ 134g | 2 | 7.4 |
# Data Info
data_t.info()
print('\n')
print('{:,} rows; {:,} columns'
.format(data_t.shape[0], data_t.shape[1]))
print('Transactions timeframe from {} to {}'.format(data_t['DATE'].min(),
data_t['DATE'].max()))
<class 'pandas.core.frame.DataFrame'> RangeIndex: 264836 entries, 0 to 264835 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DATE 264836 non-null datetime64[ns] 1 STORE_NBR 264836 non-null int64 2 LYLTY_CARD_NBR 264836 non-null int64 3 TXN_ID 264836 non-null int64 4 PROD_NBR 264836 non-null int64 5 PROD_NAME 264836 non-null object 6 PROD_QTY 264836 non-null int64 7 TOT_SALES 264836 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(5), object(1) memory usage: 16.2+ MB 264,836 rows; 8 columns Transactions timeframe from 2018-07-01 00:00:00 to 2019-06-30 00:00:00
We have 264,836 transactions from July 2018 to July 2019.
We need to change columns number 1 to 5 as categorical.
# Transform categorical features
data_t = data_t.astype({"LYLTY_CARD_NBR": 'category', "STORE_NBR": 'category', "TXN_ID": 'category',
"PROD_NBR": 'category', "PROD_NAME": 'category'})
# Debug
data_t.dtypes
DATE datetime64[ns] STORE_NBR category LYLTY_CARD_NBR category TXN_ID category PROD_NBR category PROD_NAME category PROD_QTY int64 TOT_SALES float64 dtype: object
# Check for duplicates
data_t.duplicated().sum()
# Remove duplicates
data_t.drop_duplicates()
| DATE | STORE_NBR | LYLTY_CARD_NBR | TXN_ID | PROD_NBR | PROD_NAME | PROD_QTY | TOT_SALES | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2018-10-17 | 1 | 1000 | 1 | 5 | Natural Chip Compny SeaSalt175g | 2 | 6.0 |
| 1 | 2019-05-14 | 1 | 1307 | 348 | 66 | CCs Nacho Cheese 175g | 3 | 6.3 |
| 2 | 2019-05-20 | 1 | 1343 | 383 | 61 | Smiths Crinkle Cut Chips Chicken 170g | 2 | 2.9 |
| 3 | 2018-08-17 | 2 | 2373 | 974 | 69 | Smiths Chip Thinly S/Cream&Onion 175g | 5 | 15.0 |
| 4 | 2018-08-18 | 2 | 2426 | 1038 | 108 | Kettle Tortilla ChpsHny&Jlpno Chili 150g | 3 | 13.8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 264831 | 2019-03-09 | 272 | 272319 | 270088 | 89 | Kettle Sweet Chilli And Sour Cream 175g | 2 | 10.8 |
| 264832 | 2018-08-13 | 272 | 272358 | 270154 | 74 | Tostitos Splash Of Lime 175g | 1 | 4.4 |
| 264833 | 2018-11-06 | 272 | 272379 | 270187 | 51 | Doritos Mexicana 170g | 2 | 8.8 |
| 264834 | 2018-12-27 | 272 | 272379 | 270188 | 42 | Doritos Corn Chip Mexican Jalapeno 150g | 2 | 7.8 |
| 264835 | 2018-09-22 | 272 | 272380 | 270189 | 74 | Tostitos Splash Of Lime 175g | 2 | 8.8 |
264835 rows × 8 columns
# Change column name capitlization
data_t.rename(columns=lambda x: x.lower(), inplace=True)
# Debug
data_t.head()
| date | store_nbr | lylty_card_nbr | txn_id | prod_nbr | prod_name | prod_qty | tot_sales | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2018-10-17 | 1 | 1000 | 1 | 5 | Natural Chip Compny SeaSalt175g | 2 | 6.0 |
| 1 | 2019-05-14 | 1 | 1307 | 348 | 66 | CCs Nacho Cheese 175g | 3 | 6.3 |
| 2 | 2019-05-20 | 1 | 1343 | 383 | 61 | Smiths Crinkle Cut Chips Chicken 170g | 2 | 2.9 |
| 3 | 2018-08-17 | 2 | 2373 | 974 | 69 | Smiths Chip Thinly S/Cream&Onion 175g | 5 | 15.0 |
| 4 | 2018-08-18 | 2 | 2426 | 1038 | 108 | Kettle Tortilla ChpsHny&Jlpno Chili 150g | 3 | 13.8 |
# Lets look for outliers in total sales and quantity
plt.scatter(x='prod_qty', y='tot_sales', data=data_t);
# Lets remove outliers using IQR method
# 1. Calculate IQR
Q1 = data_t.tot_sales.quantile(0.25)
Q3 = data_t.tot_sales.quantile(0.75)
IQR = round(Q3-Q1, 3)
Lower_Whisker = round(Q1-1.5*IQR, 3)
Upper_Whisker = round(Q3+1.5*IQR, 3)
# 2. Drop sales grater that upper whisker
data_t = data_t[data_t['tot_sales'] < Upper_Whisker]
# Debug outliers
import seaborn as sns
plt.grid(False)
sns.boxplot(data=data_t, x=data_t.tot_sales)
plt.scatter(x='prod_qty', y='tot_sales', data=data_t, )
<matplotlib.collections.PathCollection at 0x20502625d30>
Let’s check the products for consistency i.e. we don’t have any other products in there.
# Clean product name (Remove excess spaces) and remove digits
import re
data_t['prod_name'] = data_t.prod_name.apply(lambda x: re.sub(' +', ' ', x))
# Remove numbers from prodcut name to check for most commun words
names = data_t.prod_name.apply(lambda x: "".join(
filter(lambda x: not x.isdigit(), x)))
# Get series with individual words to count
check = names.str.split()
# Import Counter
from collections import Counter
# Create counter function
def counter(series):
cnt = Counter()
for words in series:
for word in words:
cnt[word] += 1
return cnt
# Check if we have unexpected words
counter(check).most_common(20)
[('g', 246091),
('Chips', 49688),
('Kettle', 41141),
('&', 35490),
('Smiths', 28810),
('Salt', 27924),
('Cheese', 27809),
('Pringles', 25052),
('Doritos', 24896),
('Crinkle', 23938),
('Corn', 22014),
('Original', 21510),
('Cut', 20751),
('Chip', 18596),
('Salsa', 18054),
('Chicken', 15374),
('Sea', 14121),
('Thins', 14049),
('Sour', 13853),
('Crisps', 12583)]
Looks like Salsa repeats a lot, might be salsa falvoured chips or actual salsa. Lets confirm.
# Create salsa dataframe
salsa = data_t[data_t['prod_name'].str.contains('Salsa')]
# Check the list of products
salsa_names = salsa.prod_name.unique()
list(salsa_names)
['Old El Paso Salsa Dip Tomato Mild 300g', 'Red Rock Deli SR Salsa & Mzzrlla 150g', 'Smiths Crinkle Cut Tomato Salsa 150g', 'Doritos Salsa Medium 300g', 'Old El Paso Salsa Dip Chnky Tom Ht300g', 'Woolworths Mild Salsa 300g', 'Old El Paso Salsa Dip Tomato Med 300g', 'Woolworths Medium Salsa 300g', 'Doritos Salsa Mild 300g']
After some googling looks like these products are not chips and indeed salsa, lets drop all rows with these products.
# Debug
prod = data_t.prod_name.count()
remove = salsa.prod_name.count()
result = prod - remove
print('After we remove the {} salsa products we should have {} products'.format(
remove, result))
After we remove the 18054 salsa products we should have 246204 products
# Get the index numbers of rows to drop
index = data_t[data_t['prod_name'].isin(salsa_names)].index.values
len(index)
18054
# Drop the salsa rows
data_t = data_t.drop(index)
# Debug
data_t.prod_name.count()
246204
Looks like we have solved all data quality issues for the transacitons data set.
Now on to the customers data set.
# Get a sample of the costumers dataset
data_c.sample(5)
| LYLTY_CARD_NBR | LIFESTAGE | PREMIUM_CUSTOMER | |
|---|---|---|---|
| 32852 | 121237 | YOUNG SINGLES/COUPLES | Budget |
| 31605 | 116132 | OLDER SINGLES/COUPLES | Budget |
| 53637 | 201227 | MIDAGE SINGLES/COUPLES | Budget |
| 18076 | 66096 | RETIREES | Budget |
| 65390 | 245192 | YOUNG SINGLES/COUPLES | Premium |
data_c.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 72637 entries, 0 to 72636 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LYLTY_CARD_NBR 72637 non-null int64 1 LIFESTAGE 72637 non-null object 2 PREMIUM_CUSTOMER 72637 non-null object dtypes: int64(1), object(2) memory usage: 1.7+ MB
# Change column name capitlization
data_c.rename(columns=lambda x: x.lower(), inplace=True)
# Change life stage capitlization
data_c.lifestage = data_c.lifestage.apply(lambda x: x.capitalize())
# Check for duplicates
data_c.duplicated().sum()
# Remove duplicates
data_c.drop_duplicates()
| lylty_card_nbr | lifestage | premium_customer | |
|---|---|---|---|
| 0 | 1000 | Young singles/couples | Premium |
| 1 | 1002 | Young singles/couples | Mainstream |
| 2 | 1003 | Young families | Budget |
| 3 | 1004 | Older singles/couples | Mainstream |
| 4 | 1005 | Midage singles/couples | Mainstream |
| ... | ... | ... | ... |
| 72632 | 2370651 | Midage singles/couples | Mainstream |
| 72633 | 2370701 | Young families | Mainstream |
| 72634 | 2370751 | Young families | Premium |
| 72635 | 2370961 | Older families | Budget |
| 72636 | 2373711 | Young singles/couples | Mainstream |
72637 rows × 3 columns
The customer dataset looks good, now we can procced to merging the two datasets.
data_t.shape, data_c.shape
((246204, 8), (72637, 3))
# Merge the two datasets
data_final = pd.merge(data_t, data_c, on='lylty_card_nbr')
data_final.shape
(246204, 10)
# Save data
data_final.to_csv('data_cleaned_merged_v1.csv', index=False,)
The goal in this section is to generate new columns (features) from the existing ones to enrich our dataset.
data_final = pd.read_csv('data_cleaned_merged_v1.csv')
# Lets create a price column
data_final['price'] = data_final.tot_sales / data_final.prod_qty
data_final.price.describe()
count 246204.000000 mean 3.831039 std 1.077786 min 1.320000 25% 3.000000 50% 3.800000 75% 4.600000 max 6.500000 Name: price, dtype: float64
# Lets take the first word of the product name as brand name
data_final['brand'] = data_final.prod_name.apply(
lambda x: str(x.split(' ')[0]))
data_final.brand.unique()
array(['Natural', 'CCs', 'WW', 'Smiths', 'Kettle', 'Tyrrells',
'Infuzions', 'Grain', 'Doritos', 'Thins', 'Red', 'GrnWves',
'Tostitos', 'Pringles', 'Twisties', 'RRD', 'Infzns', 'Burger',
'NCC', 'Cheezels', 'Smith', 'Cobs', 'Dorito', 'Sunbites',
'Cheetos', 'Woolworths', 'French', 'Snbts'], dtype=object)
data_final.sample()
| date | store_nbr | lylty_card_nbr | txn_id | prod_nbr | prod_name | prod_qty | tot_sales | lifestage | premium_customer | price | brand | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 187558 | 2018-07-12 | 227 | 227173 | 229318 | 70 | Tyrrells Crisps Lightly Salted 165g | 2 | 8.4 | Young families | Budget | 4.2 | Tyrrells |
Looks like the brand names are not standardized. We have errors like : Doritos and Dorito. We can manualy map the names as they are not many.
# Map the correct brand names
data_final.brand.replace({'Natural': 'Naturals', 'Old': 'Old Dutch', 'Dorito': 'Doritos', 'Grain': 'Grain Waves',
'Red': 'RRD', 'GrnWves': 'Grain Waves', 'Infzns': 'Infuzions', 'Burger': 'Bruger Rings', 'Smith': 'Smiths',
'Woolworths': 'WW', 'French': 'French Fries', 'Snbts': 'Sunbites'}, inplace=True)
# Debug
data_final.brand.unique()
array(['Naturals', 'CCs', 'WW', 'Smiths', 'Kettle', 'Tyrrells',
'Infuzions', 'Grain Waves', 'Doritos', 'Thins', 'RRD', 'Tostitos',
'Pringles', 'Twisties', 'Bruger Rings', 'NCC', 'Cheezels', 'Cobs',
'Sunbites', 'Cheetos', 'French Fries'], dtype=object)
Cool, now we have added a brand column to our dataset by extracting it from the product names.
# Get the Final part of the product name (Where we can find the Grams)
import re
data_final['packet_size'] = data_final.prod_name.apply(
lambda x: str(x.split(' ')))
# Extract only the number from the string as an array
data_final['packet_size'] = data_final.packet_size.apply(
lambda x: re.findall('[0-9]+', x))
# Convert array object to string
data_final['packet_size'] = data_final['packet_size'].apply(
lambda x: ' '.join(map(str, x)))
# Debug
data_final['packet_size'].value_counts()
175 66222 150 40135 134 25052 110 22335 170 19952 165 15269 330 12480 380 6388 210 6265 270 6262 200 4473 135 3251 250 3158 90 3008 190 2994 160 2970 220 1564 70 1507 180 1465 125 1454 Name: packet_size, dtype: int64
# Change to number format
data_final['packet_size'] = pd.to_numeric(data_final['packet_size'],errors = 'coerce')
# Debug
data_final['packet_size'].value_counts().sort_index()
70 1507 90 3008 110 22335 125 1454 134 25052 135 3251 150 40135 160 2970 165 15269 170 19952 175 66222 180 1465 190 2994 200 4473 210 6265 220 1564 250 3158 270 6262 330 12480 380 6388 Name: packet_size, dtype: int64
We have succesfully extracted and converted the weight of each packet to a usable format.
We can use the packet size column to generate size segments, i will divide the sizes in to small, medium, big and verry big.
# Sizes box plot to understand how the sizes are distributed
sns.boxplot(x=data_final.packet_size, data= data_final);
The box plot shows us that most of the products weigh between 150g and 175g, we will use that as our middle point to create our size category. Lets get to coding.
# Create a function to atribute each row to a class depending on packet wheight
def packet_size(df):
if ((df['packet_size'] >= 70) and (df['packet_size'] < 101)):
return 'Small'
elif ((df['packet_size'] >= 101) and (df['packet_size'] < 151)):
return 'Medium'
elif ((df['packet_size'] >= 151) and (df['packet_size'] < 201)):
return 'Big'
elif ((df['packet_size'] >= 201) and (df['packet_size'] < 251)):
return 'Very_Big'
elif ((df['packet_size'] >= 251) and (df['packet_size'] < 301)):
return 'Large'
else:
return 'Very_Large'
# Create new packet_size_category using the above function
data_final['packet_size_category'] = data_final.apply(packet_size, axis=1)
# Debug
data_final.groupby('packet_size_category')[
'packet_size'].describe().sort_values(by='count', ascending=False)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| packet_size_category | ||||||||
| Big | 113345.0 | 173.827121 | 7.302311 | 160.0 | 170.0 | 175.0 | 175.0 | 200.0 |
| Medium | 92227.0 | 135.044000 | 15.846845 | 110.0 | 125.0 | 134.0 | 150.0 | 150.0 |
| Very_Large | 18868.0 | 346.928132 | 23.661670 | 330.0 | 330.0 | 330.0 | 380.0 | 380.0 |
| Very_Big | 10987.0 | 222.920724 | 17.527316 | 210.0 | 210.0 | 210.0 | 250.0 | 250.0 |
| Large | 6262.0 | 270.000000 | 0.000000 | 270.0 | 270.0 | 270.0 | 270.0 | 270.0 |
| Small | 4515.0 | 83.324474 | 9.432263 | 70.0 | 70.0 | 90.0 | 90.0 | 90.0 |
As the box plot showed we can see that we have 205,567 units sold that weigh between 110 and 175g, we can see already there is a preference for that size range. We will confirm that later on in the analysis.
# Dtypes Check
data_final.dtypes
date object store_nbr int64 lylty_card_nbr int64 txn_id int64 prod_nbr int64 prod_name object prod_qty int64 tot_sales float64 lifestage object premium_customer object price float64 brand object packet_size int64 packet_size_category object dtype: object
# Set category data type
data_final = data_final.astype(
{"lifestage": 'category', "premium_customer": 'category', "brand": 'object'})
# Change date column to date Type
data_final.date=pd.to_datetime(data_final.date)
# Debug
data_final.dtypes
date datetime64[ns] store_nbr int64 lylty_card_nbr int64 txn_id int64 prod_nbr int64 prod_name object prod_qty int64 tot_sales float64 lifestage category premium_customer category price float64 brand object packet_size int64 packet_size_category object dtype: object
# Create Day of week and month
data_final['day_week'] = data_final.date.dt.day_name()
data_final['month'] = data_final.date.dt.month_name()
# Debug
data_final.sample(5)
| date | store_nbr | lylty_card_nbr | txn_id | prod_nbr | prod_name | prod_qty | tot_sales | lifestage | premium_customer | price | brand | packet_size | packet_size_category | day_week | month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 218624 | 2018-07-15 | 188 | 188438 | 189817 | 70 | Tyrrells Crisps Lightly Salted 165g | 2 | 8.4 | Young singles/couples | Budget | 4.2 | Tyrrells | 165 | Big | Sunday | July |
| 14427 | 2018-11-29 | 231 | 231011 | 233715 | 47 | Doritos Corn Chips Original 170g | 2 | 8.8 | Midage singles/couples | Mainstream | 4.4 | Doritos | 170 | Big | Thursday | November |
| 53654 | 2018-12-03 | 241 | 241234 | 245867 | 86 | Cheetos Puffs 165g | 2 | 5.6 | Older families | Mainstream | 2.8 | Cheetos | 165 | Big | Monday | December |
| 237091 | 2019-06-26 | 220 | 220395 | 219951 | 21 | WW Sour Cream &OnionStacked Chips 160g | 2 | 3.8 | Young singles/couples | Mainstream | 1.9 | WW | 160 | Big | Wednesday | June |
| 68264 | 2019-02-20 | 58 | 58254 | 53925 | 89 | Kettle Sweet Chilli And Sour Cream 175g | 2 | 10.8 | Older families | Premium | 5.4 | Kettle | 175 | Big | Wednesday | February |
data_final.to_csv('data_final.csv')
Now that we have cleaned our data we are ready to start analysing.
We know that the client is intrested in understanding her customers purchasing patterns and finding out which customer groups to target to increase her sales. To get a better understanding of the object of analysis which in this case is the customers lets rank the segments on the following metrics:
After getting a clear picture from these metrics we can dig deeper. Lets get on with it.
# How many customers do we have in each segment ?
# Create the dataframe containing the count of each member
segment_count = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'],as_index=False)[
'lylty_card_nbr'].nunique().sort_values(
by = 'lylty_card_nbr',ascending=False))
# Create percent of total columns
segment_count['percent_total'] = ((segment_count.lylty_card_nbr / segment_count.lylty_card_nbr.sum())*100).round(2)
segment_count['percent_total'] = segment_count['percent_total'].astype(str) + ' %'
# Plot Number of customers per segment
import plotly.express as px
barchart = px.bar(
data_frame = segment_count,
y = 'lylty_card_nbr',
x = 'lifestage',
color = 'premium_customer',
opacity = 0.9,
orientation ='v',
barmode = 'group',
text = 'percent_total',
title = 'Count of Customers by Segment',
labels = {'lylty_card_nbr':'Number of Cusotumers','premium_customer':'Customer Shopping Segment'}
)
barchart
# Which segment brought in most revenue ?
# Create total sales by segment dataframe
segment_revenue = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'],as_index=False)[
'tot_sales'].sum().sort_values(
by = 'tot_sales',ascending=False))
# Create percent of total columns
segment_revenue['percent_total'] = ((segment_revenue.tot_sales / segment_revenue.tot_sales.sum())*100).round(2)
segment_revenue['percent_total'] = segment_revenue['percent_total'].astype(str) + ' %'
# Plot the segments
barchart = px.bar(
data_frame = segment_revenue,
y = 'tot_sales',
x = 'lifestage',
color = 'premium_customer',
opacity = 0.9,
orientation ='v',
barmode = 'group',
text = 'percent_total',
title = 'Total Sales by Segments',
labels = {'tot_sales':'Total Ammount of Sales','premium_customer':'Customer Shopping Segment'},
)
barchart
# Create revenue per customer dataframe
revenue_member = segment_count.drop(['lylty_card_nbr','percent_total'],axis=1)
revenue_member['revenue_member'] = round(segment_revenue.tot_sales / segment_count.lylty_card_nbr,2)
# Plot revenue per customer
barchart = px.bar(
data_frame = revenue_member,
y = 'revenue_member',
x = 'lifestage',
color = 'premium_customer',
opacity = 0.9,
orientation ='v',
barmode = 'group',
text = 'revenue_member',
title = 'Revenue Per Customer by Segment',
labels = {'tot_sales':'Total Ammount of Sales','premium_customer':'Customer Shopping Segment'}
)
barchart
# Create segment total quantity dataframe
segment_quantity = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'],as_index=False)[
'prod_qty'].sum().sort_values(
by = 'prod_qty',ascending=False))
# Create quantity per member column
segment_quantity['qty_member'] = round(segment_quantity.prod_qty / segment_count.lylty_card_nbr,2)
# Plot quantity per member
barchart = px.bar(
data_frame = segment_quantity,
y = 'qty_member',
x = 'lifestage',
color = 'premium_customer',
opacity = 0.9,
orientation ='v',
barmode = 'group',
text = 'qty_member',
title = 'Quantity Sold by Count of Customer',
labels = {'qty_member':'Average Quantity','premium_customer':'Customer Shopping Segment'}
)
barchart
# Create average price per customer database
average_price = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'],as_index=False)[
'price'].mean().sort_values(
by = 'price',ascending=False)).round(2)
# Plot average price per segment
barchart = px.bar(
data_frame = average_price,
y = 'price',
log_y =True,
x = 'lifestage',
color = 'premium_customer',
opacity = 0.9,
orientation ='v',
barmode = 'group',
text = 'price',
title = 'Average Price of Chip Bought',
labels = {'price':'Average Price','premium_customer':'Customer Shopping Segment'}
)
barchart
Biggest Segments:
Highest Revenue:
Revenuee per Customer:
Average Number of Packets bought:
Average Price Per Chip:
Our biggest segment is Young Singles/Couples (Mainstream), we should expect that since they are the majority they should bring in the most revenue but interestingly enough the highest proportion of revenue (8.68%) comes from Older Families (budget) which make up 6.47% of customers.
Young Singles/Couples (Mainstream) that make up 11.1% of all customers were responsible for 8.19% of total revenue the second best. This can be explained by the fact that Older and Younger families buy more packets, on average about 9 compared to about 5 for Young Singles/Couples (Mainstream).
On the other hand, Young Singles/Couples (Mainstream) pay more per chip, on average 4.06 while Older Families (budget) spend on average 3.74.
We can conclude that Young Singles/Couples (Mainstream) are more prone to impulse purchases and since they have less people in the household, they purchase the more expensive brands at less quantity.
On the contrary Older Families (budget) try to buy cheaper brands but more quantity.
We can visualise that relationship by plotting average price vs average quantity.
But before we have to test if the Average price difference is statistically significant before we start assuming it in our analysis.
We have selected Midage and Young singles/couples both mainstream as our chip fans. Lets test these sample to see if the differance is significant if yes we will dig a bit deeper to see what might cause the above average spend. Lets test Midage singles/couples & Young singles/couples Mainstream
# Create a DataFrame for the Chip Fans
chip_fans = pd.DataFrame(data_final.query(
'lifestage == ("Midage singles/couples", "Young singles/couples") & premium_customer == "Mainstream" '))
# Debug
chip_fans.lifestage.unique()
['Midage singles/couples', 'Young singles/couples'] Categories (2, object): ['Midage singles/couples', 'Young singles/couples']
Lets conduct a 2 sample t-test on the control population (all segments except our chip fans) and our chip fans.
# Get index of chip fans sample
index = chip_fans.index
# Remove chip fans from the population
control = data_final.drop(index)
# Create experiment set
experiment = chip_fans
2 sample t-test assums that both data sets must have almost the same mean and they are identically and idependentaly distributed. We can check for the mean and skew of the datasets to see if our datasets pass those assumptions
# Create control and experiment group and print skew and mean
control_mean_sales = control.groupby('lylty_card_nbr')['tot_sales'].mean()
experiment_mean_sales = experiment.groupby('lylty_card_nbr')[
'tot_sales'].mean()
skew_control = round(control_mean_sales.skew(), 3)
skew_experiment = round(experiment_mean_sales.skew(), 3)
print('Control mean:{} skew: {}\nExperiment mean:{} skew: {}'.format(
control_mean_sales.mean(), skew_control, experiment_mean_sales.mean(), skew_experiment))
Control mean:7.023056981996668 skew: -0.361 Experiment mean:7.426350719604131 skew: -0.356
Looks like we have the same distributions for both data sets. Lets proced with the test.\ Our null hypothesis : Both groups have no statisticaly signifacant differance in average ammount of sales.
# Import the library
from scipy import stats
tstat, pval = stats.ttest_ind(control_mean_sales, experiment_mean_sales)
print(pval)
if pval < 0.05:
print('We can reject the null hypothesis.')
else:
print('We have failed to reject the null hypothesis.')
2.847236809420268e-81 We can reject the null hypothesis.
We have a pvalue much lower than our alpha of 0.05 so we can conclude that the diferance of average price is not due to chance. Now we can assume that these segments are more likly to buy more expensive brands.
Lets see the relationship between quantity sold and average price.
# Create average quantity vs average price dataframe
qty_price = pd.merge(average_price,segment_quantity)
#debug
qty_price.sample(3)
| lifestage | premium_customer | price | prod_qty | qty_member | |
|---|---|---|---|---|---|
| 11 | Midage singles/couples | Premium | 3.77 | 14327 | 6.05 |
| 2 | Retirees | Budget | 3.92 | 26783 | 6.11 |
| 10 | Older singles/couples | Mainstream | 3.81 | 32444 | 6.69 |
# Plot the scatter plot
scatter = px.scatter(
data_frame = qty_price,
x = 'price',
y = 'qty_member',
size = 'prod_qty',
color = 'lifestage',
hover_name = 'premium_customer',
title = 'Average Qty Vs Average Price'
)
scatter
The above scatter plot shows some interesting insights, as we thought we see families tend to purchase more quantity but tend to buy less expensive chips but this is noted only for the Older and Young families, New families do not follow this trend, they show preference for quality over quantity indicated by lower quantity and higher average price.
Also, interestingly we see that Young Singles/Couples and Midge Singles/Couples are willing to pay more per chip when compared with their premium and budget counterparts. We can note that Sinlges/Couples buy more as they get older, note that Young S/C buy on average less than Midage S/C and so on, maybe due to increased health awareness from younger demographics.
Let’s get to know better the segments that buy more than the average and pay more than average with the intent of finding ways to keep these segments buying and retaining their business.
Quantity:
Price:
Its in the clients intrest to retain and serve well these segments, for this the store should keep stocked the brands and sizes these segments tend to buy the most. Lets figure out which those are.
We are going to create a affinity table for brand and packet size for each of the 4 segments. How are we going to make this affinity table? Well if you think about it we can only get an understanding about prefrences towards a specific brand or size by comparing between what proportion a brand ocupies in the overall total quantity.
Example : Lets take in to consideration our Older families segment, We will compare how much the brand 'Kettle' contributes to total quantity sold between our target segment and the population (Everyone - our target). In this example we can imagine something like this:
Brand : Kettle
Target %: 0.2
Population %: 0.16
Affinity = 0.2/0.16 = 1.25. We can say that our target segment is 25% more likley to purchase kettle over the population.
Lets code the following steps in a function so we can apply them to all the 4 segments with out the need to code it each time.
# Create Affinity Calculator Function
def affinity(target,population,metric):
"""
This function takes in 2 dataframes and metric to compare the populations for
affinity towards a specific metric. For the pourpuses of this project metric has to be either
brand or packet_size.
"""
# Create Quantity Totals
qty_target = target.prod_qty.sum()
qty_population = population.prod_qty.sum()
# Create Total per brand data set
target_qty_brand = pd.DataFrame(target.groupby(metric,as_index=False)['prod_qty'].sum())
population_qty_brand = pd.DataFrame(population.groupby(metric,as_index=False)['prod_qty'].sum())
# Create qty_total columns
target_qty_brand['qty_total'] = target_qty_brand.prod_qty / qty_target
population_qty_brand['qty_total'] = population_qty_brand.prod_qty / qty_population
# Rename Columns
target_qty_brand.rename(columns = {'qty_total':'Target'},inplace=True)
population_qty_brand.rename(columns = {'qty_total':'Population'}, inplace=True)
# Drop Unnecesary columns
target_qty_brand.drop('prod_qty',axis = 1, inplace=True)
population_qty_brand.drop('prod_qty', axis = 1, inplace=True)
# Create Affinity Dataframe
affinity = pd.merge(target_qty_brand,population_qty_brand)
# Create Proportion columns
affinity['Proportion'] = affinity.Target / affinity.Population
return affinity.nlargest(5,'Proportion')
# Create older families dataset
older_families = data_final.query('lifestage == "Older families"')
#Debug
older_families.lifestage.unique()
['Older families'] Categories (1, object): ['Older families']
# Create Younger Families dataset
young_families = data_final.query('lifestage == "Young families"')
#Debug
young_families.lifestage.unique()
['Young families'] Categories (1, object): ['Young families']
# Create Target and Population DataSets
target = older_families
index = older_families.index
population = data_final.drop(index)
# Get affinity table for brand
affinity(target,population,'brand')
| brand | Target | Population | Proportion | |
|---|---|---|---|---|
| 10 | NCC | 0.007039 | 0.005391 | 1.305685 |
| 0 | Bruger Rings | 0.007794 | 0.006014 | 1.296110 |
| 20 | WW | 0.058155 | 0.045325 | 1.283070 |
| 13 | RRD | 0.079902 | 0.062633 | 1.275710 |
| 2 | Cheetos | 0.013838 | 0.011339 | 1.220353 |
# Get affinity table for packet size
affinity(target,population,'packet_size')
| packet_size | Target | Population | Proportion | |
|---|---|---|---|---|
| 11 | 180 | 0.007440 | 0.005515 | 1.349092 |
| 15 | 220 | 0.007794 | 0.006014 | 1.296110 |
| 13 | 200 | 0.022079 | 0.017066 | 1.293680 |
| 12 | 190 | 0.014605 | 0.011539 | 1.265695 |
| 7 | 160 | 0.014124 | 0.011481 | 1.230196 |
Older families are about 30% more likley to buy NCC, Burger Rings, WW & RRD.
This segment is 30% more likley to buy packets sized between 180 - 220grams
# Create Target and Population DataSets
target = young_families
index = young_families.index
population = data_final.drop(index)
# Get brand affinity table young families
affinity(target,population,'brand')
| brand | Target | Population | Proportion | |
|---|---|---|---|---|
| 15 | Sunbites | 0.015027 | 0.011588 | 1.296693 |
| 6 | French Fries | 0.006925 | 0.005392 | 1.284340 |
| 1 | CCs | 0.022220 | 0.017629 | 1.260430 |
| 13 | RRD | 0.077783 | 0.063467 | 1.225560 |
| 2 | Cheetos | 0.013683 | 0.011429 | 1.197148 |
While Young families are amlost 30% more likly to buy Sunbites, French Fries and CCs. What about size ?
# Get size affinity table young families
affinity(target,population,'packet_size')
| packet_size | Target | Population | Proportion | |
|---|---|---|---|---|
| 1 | 90 | 0.015027 | 0.011588 | 1.296693 |
| 12 | 190 | 0.014335 | 0.011665 | 1.228890 |
| 13 | 200 | 0.020876 | 0.017426 | 1.197964 |
| 15 | 220 | 0.007347 | 0.006146 | 1.195475 |
| 11 | 180 | 0.006771 | 0.005694 | 1.189099 |
Young families are 30% more likely to buy 90g packets. Smaller packets might be due to the children in the family bieng smaller.
What brands sell 90g packets?
# Brands that sell 90 packets.
data_final.query('packet_size == "90"').prod_name.unique()
array(['Sunbites Whlegrn Crisps Frch/Onin 90g',
'Snbts Whlgrn Crisps Cheddr&Mstrd 90g'], dtype=object)
We can conclude that young families are more likly to buy brands that offer smaller packets probably because they buy chips for children as a snack.
Lets do the same for these segments.
# Create Midage S/C Dataset
midage = data_final.query('lifestage == "Midage singles/couples" & premium_customer == "Mainstream"')
#Debug
midage.lifestage.unique()
['Midage singles/couples'] Categories (1, object): ['Midage singles/couples']
# Create Young S/C Dataset
young = data_final.query('lifestage == "Young singles/couples" & premium_customer == "Mainstream"')
#Debug
young.lifestage.unique()
['Young singles/couples'] Categories (1, object): ['Young singles/couples']
# Create Target and Population DataSets
target = midage
index = midage.index
population = data_final.drop(index)
# Get affinity table for brand
affinity(target,population,'brand')
| brand | Target | Population | Proportion | |
|---|---|---|---|---|
| 18 | Twisties | 0.044208 | 0.038110 | 1.160022 |
| 9 | Kettle | 0.190779 | 0.166476 | 1.145986 |
| 17 | Tostitos | 0.043592 | 0.038240 | 1.139960 |
| 4 | Cobs | 0.044635 | 0.039200 | 1.138666 |
| 3 | Cheezels | 0.019970 | 0.018458 | 1.081924 |
# Get affinity table for brand
affinity(target,population,'packet_size')
| packet_size | Target | Population | Proportion | |
|---|---|---|---|---|
| 17 | 270 | 0.030927 | 0.025266 | 1.224032 |
| 18 | 330 | 0.060099 | 0.050349 | 1.193635 |
| 2 | 110 | 0.101983 | 0.090500 | 1.126883 |
| 5 | 135 | 0.014610 | 0.013148 | 1.111181 |
| 14 | 210 | 0.027891 | 0.025372 | 1.099306 |
# Create Target and Population DataSets
target = young
index = young.index
population = data_final.drop(index)
# Get affinity table for brand
affinity(target,population,'brand')
| brand | Target | Population | Proportion | |
|---|---|---|---|---|
| 19 | Tyrrells | 0.031685 | 0.025683 | 1.233708 |
| 18 | Twisties | 0.046127 | 0.037738 | 1.222308 |
| 5 | Doritos | 0.122969 | 0.100876 | 1.219015 |
| 17 | Tostitos | 0.045462 | 0.037898 | 1.199601 |
| 9 | Kettle | 0.196984 | 0.165114 | 1.193019 |
# Get affinity table for size
affinity(target,population,'packet_size')
| packet_size | Target | Population | Proportion | |
|---|---|---|---|---|
| 17 | 270 | 0.031823 | 0.024995 | 1.273192 |
| 19 | 380 | 0.032128 | 0.025463 | 1.261783 |
| 18 | 330 | 0.061291 | 0.049911 | 1.227991 |
| 4 | 134 | 0.119366 | 0.100626 | 1.186231 |
| 2 | 110 | 0.106365 | 0.089735 | 1.185316 |
It seems like we have uncovered prefrences in all the segments, lets list the top prefrences for brand and size:
Older Families:
Brand: 30% more likley to buy NCC & Burger Rings.
Size: 30% more likley to buy packets sized 180g, 200g & 220g.
Young Families:
Brand: 30% more likley to buy Sunbites & French fries.
Size: 30% more likley to buy smallest packs 90g to be exact.
Midage S/C:
Brand: 16% more likley to buy Twisties and 14% to buy Kettle.
Size: 22% more likely to buy 270g packs and 19% 330g.
Young S/C
Brand: 23% more likley to buy Tyrrells and 22% Twisties.
Size: 27% more likley to buy 270g packs and 26% 380g
We now have the stats for our selected segments, What do these brands have in commun ? Can we see why these segments have these prefrences ? Lets see what are the commun properties between these brands.
# Create list of brands to select
brands = ['NCC','Bruger Rings','Sunbites','French Fries','Twisties','Kettle','Tyrrells']
# Create dataframe only with selected brands
selected_brands = data_final[data_final['brand'].isin(brands)]
# Debug
selected_brands.brand.unique()
array(['Kettle', 'Tyrrells', 'Twisties', 'Bruger Rings', 'NCC',
'Sunbites', 'French Fries'], dtype=object)
# Create dataset to plot
size_by_price = selected_brands.groupby('brand', as_index=False)[
['packet_size', 'price']].mean().round(2)
# Plot the scatter plot
scatter = px.scatter(
data_frame=size_by_price,
x='price',
y='packet_size',
color='brand',
size='packet_size',
title='Brands Average Size Vs Average Price',
labels={'packet_size': 'Average Pack Size'},
template='plotly_white'
)
scatter
Let’s recap what we’ve found! Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream retirees shoppers. We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more of them than other buyers.
Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips. This is indicative of impulse buying behaviour.
We’ve also found that Mainstream young singles and couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population. The Category Manager may want to increase the category’s performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments where young singles and couples frequent more often to increase visibilty and impulse behaviour. Quantium can help the Category Manager with recommendations of where these segments are and further help them with measuring the impact of the changed placement. We’ll work on measuring the impact of trials in the next project.
What have we found out?
Most sales come from Old Families (Budget) Young Singles/Couples (Mainstream) & Retirees (Mainstream). Older Families, Young Families & Older Singles/Couples bring in most revenue per customer. Young Singles/Couples (Mainstream) & Midage Singles/Couples (Mainstream) pay more per chip packet. Older families and Young families buy more packets of chips on average. Older and Young families are about 30% more likley to buy brands that cost under 3.00, young families are more likley to buy small packets. Midage and Young Singles/Couples are about 20% more likley to buy more expensive brands, above 4.00.
Our recommendations: